166 research outputs found
A Stochastic Penalty Model for Convex and Nonconvex Optimization with Big Constraints
The last decade witnessed a rise in the importance of supervised learning
applications involving {\em big data} and {\em big models}. Big data refers to
situations where the amounts of training data available and needed causes
difficulties in the training phase of the pipeline. Big model refers to
situations where large dimensional and over-parameterized models are needed for
the application at hand. Both of these phenomena lead to a dramatic increase in
research activity aimed at taming the issues via the design of new
sophisticated optimization algorithms. In this paper we turn attention to the
{\em big constraints} scenario and argue that elaborate machine learning
systems of the future will necessarily need to account for a large number of
real-world constraints, which will need to be incorporated in the training
process. This line of work is largely unexplored, and provides ample
opportunities for future work and applications. To handle the {\em big
constraints} regime, we propose a {\em stochastic penalty} formulation which
{\em reduces the problem to the well understood big data regime}. Our
formulation has many interesting properties which relate it to the original
problem in various ways, with mathematical guarantees. We give a number of
results specialized to nonconvex loss functions, smooth convex functions,
strongly convex functions and convex constraints. We show through experiments
that our approach can beat competing approaches by several orders of magnitude
when a medium accuracy solution is required
On Optimal Probabilities in Stochastic Coordinate Descent Methods
We propose and analyze a new parallel coordinate descent method---`NSync---in
which at each iteration a random subset of coordinates is updated, in parallel,
allowing for the subsets to be chosen non-uniformly. We derive convergence
rates under a strong convexity assumption, and comment on how to assign
probabilities to the sets to optimize the bound. The complexity and practical
performance of the method can outperform its uniform variant by an order of
magnitude. Surprisingly, the strategy of updating a single randomly selected
coordinate per iteration---with optimal probabilities---may require less
iterations, both in theory and practice, than the strategy of updating all
coordinates at every iteration.Comment: 5 pages, 1 algorithm (`NSync), 2 theorems, 2 figure
Linearly convergent stochastic heavy ball method for minimizing generalization error
In this work we establish the first linear convergence result for the
stochastic heavy ball method. The method performs SGD steps with a fixed
stepsize, amended by a heavy ball momentum term. In the analysis, we focus on
minimizing the expected loss and not on finite-sum minimization, which is
typically a much harder problem. While in the analysis we constrain ourselves
to quadratic loss, the overall objective is not necessarily strongly convex.Comment: NIPS 2017, Workshop on Optimization for Machine Learning (camera
ready version
Semi-Stochastic Gradient Descent Methods
In this paper we study the problem of minimizing the average of a large
number () of smooth convex loss functions. We propose a new method, S2GD
(Semi-Stochastic Gradient Descent), which runs for one or several epochs in
each of which a single full gradient and a random number of stochastic
gradients is computed, following a geometric law. The total work needed for the
method to output an -accurate solution in expectation, measured in
the number of passes over data, or equivalently, in units equivalent to the
computation of a single gradient of the loss, is
, where is the condition number.
This is achieved by running the method for epochs,
with a single gradient evaluation and stochastic gradient
evaluations in each. The SVRG method of Johnson and Zhang arises as a special
case. If our method is limited to a single epoch only, it needs to evaluate at
most stochastic gradients. In
contrast, SVRG requires stochastic gradients. To
illustrate our theoretical results, S2GD only needs the workload equivalent to
about 2.1 full gradient evaluations to find an -accurate solution for
a problem with and .Comment: 19 pages, 3 figures, 2 algorithms, 3 table
Accelerated Gossip via Stochastic Heavy Ball Method
In this paper we show how the stochastic heavy ball method (SHB) -- a popular
method for solving stochastic convex and non-convex optimization problems
--operates as a randomized gossip algorithm. In particular, we focus on two
special cases of SHB: the Randomized Kaczmarz method with momentum and its
block variant. Building upon a recent framework for the design and analysis of
randomized gossip algorithms, [Loizou Richtarik, 2016] we interpret the
distributed nature of the proposed methods. We present novel protocols for
solving the average consensus problem where in each step all nodes of the
network update their values but only a subset of them exchange their private
values. Numerical experiments on popular wireless sensor networks showing the
benefits of our protocols are also presented.Comment: 8 pages, 5 Figures, 56th Annual Allerton Conference on Communication,
Control, and Computing, 201
Coordinate Descent Face-Off: Primal or Dual?
Randomized coordinate descent (RCD) methods are state-of-the-art algorithms
for training linear predictors via minimizing regularized empirical risk. When
the number of examples () is much larger than the number of features (),
a common strategy is to apply RCD to the dual problem. On the other hand, when
the number of features is much larger than the number of examples, it makes
sense to apply RCD directly to the primal problem. In this paper we provide the
first joint study of these two approaches when applied to L2-regularized ERM.
First, we show through a rigorous analysis that for dense data, the above
intuition is precisely correct. However, we find that for sparse and structured
data, primal RCD can significantly outperform dual RCD even if , and
vice versa, dual RCD can be much faster than primal RCD even if .
Moreover, we show that, surprisingly, a single sampling strategy minimizes both
the (bound on the) number of iterations and the overall expected complexity of
RCD. Note that the latter complexity measure also takes into account the
average cost of the iterations, which depends on the structure and sparsity of
the data, and on the sampling strategy employed. We confirm our theoretical
predictions using extensive experiments with both synthetic and real data sets
Nonconvex Variance Reduced Optimization with Arbitrary Sampling
We provide the first importance sampling variants of variance reduced
algorithms for empirical risk minimization with non-convex loss functions. In
particular, we analyze non-convex versions of SVRG, SAGA and SARAH. Our methods
have the capacity to speed up the training process by an order of magnitude
compared to the state of the art on real datasets. Moreover, we also improve
upon current mini-batch analysis of these methods by proposing importance
sampling for minibatches in this setting. Surprisingly, our approach can in
some regimes lead to superlinear speedup with respect to the minibatch size,
which is not usually present in stochastic optimization. All the above results
follow from a general analysis of the methods which works with arbitrary
sampling, i.e., fully general randomized strategy for the selection of subsets
of examples to be sampled in each iteration. Finally, we also perform a novel
importance sampling analysis of SARAH in the convex setting.Comment: 9 pages, 12 figures, 25 pages of supplementary material
One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods
We propose a remarkably general variance-reduced method suitable for solving
regularized empirical risk minimization problems with either a large number of
training examples, or a large model dimension, or both. In special cases, our
method reduces to several known and previously thought to be unrelated methods,
such as {\tt SAGA}, {\tt LSVRG}, {\tt JacSketch}, {\tt SEGA} and {\tt ISEGA},
and their arbitrary sampling and proximal generalizations. However, we also
highlight a large number of new specific algorithms with interesting
properties. We provide a single theorem establishing linear convergence of the
method under smoothness and quasi strong convexity assumptions. With this
theorem we recover best-known and sometimes improved rates for known methods
arising in special cases. As a by-product, we provide the first unified method
and theory for stochastic gradient and stochastic coordinate descent type
methods.Comment: 61 pages, 6 figures, 3 table
Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory
We develop a family of reformulations of an arbitrary consistent linear
system into a stochastic problem. The reformulations are governed by two
user-defined parameters: a positive definite matrix defining a norm, and an
arbitrary discrete or continuous distribution over random matrices. Our
reformulation has several equivalent interpretations, allowing for researchers
from various communities to leverage their domain specific insights. In
particular, our reformulation can be equivalently seen as a stochastic
optimization problem, stochastic linear system, stochastic fixed point problem
and a probabilistic intersection problem. We prove sufficient, and necessary
and sufficient conditions for the reformulation to be exact. Further, we
propose and analyze three stochastic algorithms for solving the reformulated
problem---basic, parallel and accelerated methods---with global linear
convergence rates. The rates can be interpreted as condition numbers of a
matrix which depends on the system matrix and on the reformulation parameters.
This gives rise to a new phenomenon which we call stochastic preconditioning,
and which refers to the problem of finding parameters (matrix and distribution)
leading to a sufficiently small condition number. Our basic method can be
equivalently interpreted as stochastic gradient descent, stochastic Newton
method, stochastic proximal point method, stochastic fixed point method, and
stochastic projection method, with fixed stepsize (relaxation parameter),
applied to the reformulations.Comment: Accepted to SIAM Journal on Matrix Analysis and Applications. This
arXiv version has an additional section (Section 6.2), listing several
extensions done since the paper was first written. Statistics: 39 pages, 4
reformulations, 3 algorithm
Accelerated Coordinate Descent with Arbitrary Sampling and Best Rates for Minibatches
Accelerated coordinate descent is a widely popular optimization algorithm due
to its efficiency on large-dimensional problems. It achieves state-of-the-art
complexity on an important class of empirical risk minimization problems. In
this paper we design and analyze an accelerated coordinate descent (ACD) method
which in each iteration updates a random subset of coordinates according to an
arbitrary but fixed probability law, which is a parameter of the method. If all
coordinates are updated in each iteration, our method reduces to the classical
accelerated gradient descent method AGD of Nesterov. If a single coordinate is
updated in each iteration, and we pick probabilities proportional to the square
roots of the coordinate-wise Lipschitz constants, our method reduces to the
currently fastest coordinate descent method NUACDM of Allen-Zhu, Qu,
Richt\'{a}rik and Yuan.
While mini-batch variants of ACD are more popular and relevant in practice,
there is no importance sampling for ACD that outperforms the standard uniform
mini-batch sampling. Through insights enabled by our general analysis, we
design new importance sampling for mini-batch ACD which significantly
outperforms previous state-of-the-art minibatch ACD in practice. We prove a
rate that is at most times worse than the rate of
minibatch ACD with uniform sampling, but can be times
better, where is the minibatch size. Since in modern supervised learning
training systems it is standard practice to choose , and often
, our method can lead to dramatic speedups. Lastly, we obtain
similar results for minibatch nonaccelerated CD as well, achieving improvements
on previous best rates.Comment: 28 pages, 108 figure
- …